The purpose of the reserach is the analysis of data presenting number of murders across Poland between 1999 and 2020 (included), broken down by provinces and compared with data related to population and area of the provinces in 2020. The data is extracted from two sources: Police Statistics and Central Statistical Office.
The research is composed of two parts. Part 1 is focused on murder crime trend analysis over the years by provinces, and leads to the recognition of frequency and volume of crimes within particular locations. On top of that, observation dedicated to analyzing crime intensity in time and identification of periods with the highest and lowest rate levels. Additional metric taken under considaration is the general detection rate level and its values over the years. The last component covers comparison between 2019 and 2020, so that key differences can presented and examined. Part 2 enables widening the perspective by adding data about population and provinces areas and exploring correlation by identyfing coefficient relationship between variables. This part covers oly 2020 for all data included in the correlation test. Basing on the results, the relationdhip between key variables is visualised.
Definitions applied in analysis:
- administrative unit: provinces of Poland, e.g. Silesia. - crime registered: number of murders within particular province; followed with official procedures by investigative authorities. - crime detected: number of crimes (among registered) where at least one suspect is appointed. - detection rate: number of crimes detected divided by crimes registered and multiplied by 100, for percantage calculation.
Questions to answer:
Conclusions:
Crime level of the incidents classified as murders decreases with each year. When looking at the provinces with the geatest volume of murders, four of them focus attention. These are: Masovia, Silesian, Lower Silesian, Lodzkie. The first three of them represent the majority of overall share between 1999-2020, where the exact percentage rates are: Masovia (15,8%), Silesian (12,9%), Lower Silesian (9,9%). On the other hand, the lowest records are characteristic for Opole (2%), Subcarpathian (3%), Świętokrzyskie (3%), Podlaskie (3%). Interesting fact: Masovia notes the highest number of murders every year. Unlike Opolskie, which stands for the lowest results consistently.
Questions [1] and [2]Described results are not equal to high or low level of crime generally. Each case needs to be considered individually, due to local characteristics related to population and area. As the main indicator, the crime rate metric is analzyed as even when the number of crimes itself is lower, the crime rate may be higher once comparing different locations.
Detection of crimes grows significantly while number of crimes decreases. Within last 22 years, the general detection rate increases by ar.12 percentage points, from 86% to 98% countrywide. The highest records are noted in 2000 and 2001 (accordingly: 1270, 1325). On the contrary, in 2016 notes the lowest record of 456 murders. This also confirm the overall decline that happens over the years.
Questions [3] and [4]Nevertheless, there is remarkable growth in the number of murders in 2020 once comparing to 2019. In some locations the number is twice as big. As example, it is worth mentioning Masovia (+109%), Lubusz (+100%), Kuyavian-Pomeranian (+80%). The opposite situation is observed in Subcarpathian where quite spectacular decline is noted - by 50% in 2020 vs 2019. This phenomenon can be dictated by pandemic and strict rules of health control and public safety followed around the world, what led to minimized crime. In order to confirm such hypothesis, more complex and deepen analysis would be required in order to include additional factors and data volume.
Question [5]In Part 2 of the analytical work, enriched with two variables: population and area of each province in 2020, there are some relationships identified. Facotr like big population (corr. 0.8) and population density (corr 0.52) seem to present positive relationship with the number of murders. The provinces with the highest number are Masovia, Silesian. Meanwhile, the ones with the highest crime density (number of crimes per resident) are Lubusz, Lower Silesian, West Pomerania and these are classified as more dangerous.
Question [6]Second interesting thing is quite negative correlation between crime detection rate and population (-0.37); similarly with the area of the province (-0.32). The bigger the population is, the more challenging it is to identify the suspects. Within provinces with bigger area, there is also a better way to hide or escape. Next to this, there are also provinces with high detection rate as a rule, represented by: Kuyavian-Pomeranian, Opole, Lubusz. These are locations with significantly small area and higher crime density. These factors can also contribute to higher detection than in other locations on average.
Question [6]# Importing needed libraries in order to make use of them in the further analysis and visualization.
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import plotly.express as px
import plotly.io as pio
# Openning my data file from the local catalogue.
data = pd.read_csv('C:/Users/dagny/Desktop/Data & Analytics/Python/crime-murder-1999-2020.csv', sep=',')
# Checking data types in order to verify whether no update is required.
data.dtypes
administrative_unit object year int64 crime_registered int64 crime_detected int64 dtype: object
# Changing data type for column 'year' as it is not needed for calculations, meanwhile required as a string value.
data['year'] = data['year'].astype('object')
data.dtypes
administrative_unit object year object crime_registered int64 crime_detected int64 dtype: object
# Checking table structure by viewing rows and columns number.
data.shape
(396, 4)
# Analyzing data structure by calling the columns names. Additionally, previewing several first rows.
data.columns
Index(['administrative_unit', 'year', 'crime_registered', 'crime_detected'], dtype='object')
data.head()
| administrative_unit | year | crime_registered | crime_detected | |
|---|---|---|---|---|
| 0 | Lower Silesian | 2020 | 69 | 68 |
| 1 | Lower Silesian | 2019 | 33 | 33 |
| 2 | Lower Silesian | 2018 | 59 | 55 |
| 3 | Lower Silesian | 2017 | 52 | 51 |
| 4 | Lower Silesian | 2016 | 53 | 53 |
# Displaying maximum, minimum and average values; verifying whether records count is the same per columns.
data.describe()
| crime_registered | crime_detected | |
|---|---|---|
| count | 396.000000 | 396.000000 |
| mean | 49.959596 | 45.926768 |
| std | 35.763901 | 30.785862 |
| min | 4.000000 | 5.000000 |
| 25% | 25.750000 | 24.000000 |
| 50% | 41.000000 | 39.000000 |
| 75% | 64.250000 | 59.000000 |
| max | 233.000000 | 179.000000 |
# Checking for null values in order to clean / update / replace / delete.
data.isnull().values.any()
False
# Displaying unique values from the first column (administrative_unit) in order to see all of the locations.
data['administrative_unit'].unique()
array(['Lower Silesian', 'Kuyavian-Pomeranian', 'Lublin', 'Lubusz',
'Lodzkie', 'Lesser Poland',
'Masovia (KWP z/s w Radomiu i KSP Warszawa)', 'Opole',
'Subcarpathian', 'Podlaskie', 'Pomerania', 'Silesian',
'Świętokrzyskie', 'Warmian-Masurian', 'Greater Poland',
'West Pomerania', 'teren działania KWP z/s w Radomiu',
'teren działania KSP Warszawa'], dtype=object)
# Updating naming of the Masovia province so that naming is kept consistent with the rest from the set.
data.loc[data['administrative_unit'] == 'Masovia (KWP z/s w Radomiu i KSP Warszawa)', 'administrative_unit'] = 'Masovia'
# Additionally, I eliminate data from two locations ('teren działania KWP z/s w Radomiu' and 'teren działania KSP Warszawa').
# These are already included in summarized data for Masovia, no need to duplicate it.
# Saving the data within a new file so that I can keep the set locally.
data = data[~data['administrative_unit'].isin(['teren działania KWP z/s w Radomiu', 'teren działania KSP Warszawa'])]
data.to_csv('C:/Users/dagny/Desktop/Data & Analytics/Python/crime-murder-UPDATED.csv', index = False)
data['administrative_unit'].unique()
array(['Lower Silesian', 'Kuyavian-Pomeranian', 'Lublin', 'Lubusz',
'Lodzkie', 'Lesser Poland', 'Masovia', 'Opole', 'Subcarpathian',
'Podlaskie', 'Pomerania', 'Silesian', 'Świętokrzyskie',
'Warmian-Masurian', 'Greater Poland', 'West Pomerania'],
dtype=object)
# Adding column with crime detection calculation (percentage) and adjusting decimal points.
# Updating the source file (UPDATED version) and checking its structure afterwards.
data['detection_rate'] = (data['crime_detected']/data['crime_registered']*100).round()
data.to_csv('C:/Users/dagny/Desktop/Data & Analytics/Python/crime-murder-UPDATED.csv', index=False)
data.head()
| administrative_unit | year | crime_registered | crime_detected | detection_rate | |
|---|---|---|---|---|---|
| 0 | Lower Silesian | 2020 | 69 | 68 | 99.0 |
| 1 | Lower Silesian | 2019 | 33 | 33 | 100.0 |
| 2 | Lower Silesian | 2018 | 59 | 55 | 93.0 |
| 3 | Lower Silesian | 2017 | 52 | 51 | 98.0 |
| 4 | Lower Silesian | 2016 | 53 | 53 | 100.0 |
# Checking data types after the updates. The newly added one is used later in calculations.
data.dtypes
administrative_unit object year object crime_registered int64 crime_detected int64 detection_rate float64 dtype: object
# Checking data for different provinces, for overall preview of the details.
data.loc[data['administrative_unit'] == 'Lower Silesian']
| administrative_unit | year | crime_registered | crime_detected | detection_rate | |
|---|---|---|---|---|---|
| 0 | Lower Silesian | 2020 | 69 | 68 | 99.0 |
| 1 | Lower Silesian | 2019 | 33 | 33 | 100.0 |
| 2 | Lower Silesian | 2018 | 59 | 55 | 93.0 |
| 3 | Lower Silesian | 2017 | 52 | 51 | 98.0 |
| 4 | Lower Silesian | 2016 | 53 | 53 | 100.0 |
| 5 | Lower Silesian | 2015 | 51 | 46 | 90.0 |
| 6 | Lower Silesian | 2014 | 41 | 40 | 98.0 |
| 7 | Lower Silesian | 2013 | 70 | 69 | 99.0 |
| 8 | Lower Silesian | 2012 | 62 | 58 | 94.0 |
| 9 | Lower Silesian | 2011 | 61 | 58 | 95.0 |
| 10 | Lower Silesian | 2010 | 63 | 61 | 97.0 |
| 11 | Lower Silesian | 2009 | 82 | 83 | 101.0 |
| 12 | Lower Silesian | 2008 | 88 | 86 | 98.0 |
| 13 | Lower Silesian | 2007 | 94 | 94 | 100.0 |
| 14 | Lower Silesian | 2006 | 84 | 84 | 100.0 |
| 15 | Lower Silesian | 2005 | 70 | 64 | 91.0 |
| 16 | Lower Silesian | 2004 | 107 | 105 | 98.0 |
| 17 | Lower Silesian | 2003 | 106 | 101 | 95.0 |
| 18 | Lower Silesian | 2002 | 119 | 109 | 92.0 |
| 19 | Lower Silesian | 2001 | 125 | 114 | 91.0 |
| 20 | Lower Silesian | 2000 | 134 | 123 | 92.0 |
| 21 | Lower Silesian | 1999 | 101 | 87 | 86.0 |
# Some records are incorrect as detection_rate column includes values greater than 100.0, what is impossible.
# The number of crime_detected is higher than crime_registered in these cases. This requires switch of the vaues in columns.
data.loc[data['detection_rate'] > 100.00].count()
administrative_unit 17 year 17 crime_registered 17 crime_detected 17 detection_rate 17 dtype: int64
data.loc[data['detection_rate'] > 100.00]
| administrative_unit | year | crime_registered | crime_detected | detection_rate | |
|---|---|---|---|---|---|
| 11 | Lower Silesian | 2009 | 82 | 83 | 101.0 |
| 34 | Kuyavian-Pomeranian | 2008 | 26 | 27 | 104.0 |
| 60 | Lublin | 2004 | 51 | 52 | 102.0 |
| 174 | Opole | 2000 | 23 | 24 | 104.0 |
| 179 | Subcarpathian | 2017 | 14 | 15 | 107.0 |
| 213 | Podlaskie | 2005 | 25 | 26 | 104.0 |
| 225 | Pomerania | 2015 | 28 | 29 | 104.0 |
| 231 | Pomerania | 2009 | 32 | 33 | 103.0 |
| 242 | Silesian | 2020 | 82 | 83 | 101.0 |
| 251 | Silesian | 2011 | 86 | 88 | 102.0 |
| 255 | Silesian | 2007 | 101 | 103 | 102.0 |
| 270 | Świętokrzyskie | 2014 | 4 | 5 | 125.0 |
| 273 | Świętokrzyskie | 2011 | 29 | 30 | 103.0 |
| 296 | Warmian-Masurian | 2010 | 25 | 26 | 104.0 |
| 302 | Warmian-Masurian | 2004 | 36 | 37 | 103.0 |
| 316 | Greater Poland | 2012 | 39 | 40 | 103.0 |
| 338 | West Pomerania | 2012 | 44 | 45 | 102.0 |
# Switching the values within identified rows.
data[['crime_registered','crime_detected']] =\
data[['crime_detected','crime_registered']].where(data['detection_rate'] > 100.00,
data[['crime_registered','crime_detected']].values)
# Repeating the calculation for detection_rate with proper decimal points definition.
data['detection_rate'] = (data['crime_detected']/data['crime_registered']*100).round()
data.head()
| administrative_unit | year | crime_registered | crime_detected | detection_rate | |
|---|---|---|---|---|---|
| 0 | Lower Silesian | 2020 | 69 | 68 | 99.0 |
| 1 | Lower Silesian | 2019 | 33 | 33 | 100.0 |
| 2 | Lower Silesian | 2018 | 59 | 55 | 93.0 |
| 3 | Lower Silesian | 2017 | 52 | 51 | 98.0 |
| 4 | Lower Silesian | 2016 | 53 | 53 | 100.0 |
# Checking on random record to verify the correctness.
data[11:12]
| administrative_unit | year | crime_registered | crime_detected | detection_rate | |
|---|---|---|---|---|---|
| 11 | Lower Silesian | 2009 | 83 | 82 | 99.0 |
# Checking all records in order to see whether there is still anything incorrect.
data.loc[data['detection_rate'] > 100.00].count()
administrative_unit 0 year 0 crime_registered 0 crime_detected 0 detection_rate 0 dtype: int64
# Grouping data by province and year. 16 provinces expected with 22 records from 1999-2020.
data_groupby_province = data.groupby('administrative_unit')
data_groupby_province.ngroups
16
data_groupby_province.size()
administrative_unit Greater Poland 22 Kuyavian-Pomeranian 22 Lesser Poland 22 Lodzkie 22 Lower Silesian 22 Lublin 22 Lubusz 22 Masovia 22 Opole 22 Podlaskie 22 Pomerania 22 Silesian 22 Subcarpathian 22 Warmian-Masurian 22 West Pomerania 22 Świętokrzyskie 22 dtype: int64
data_groupby_year = data.groupby('year')
data_groupby_year.ngroups
22
data_groupby_year.size()
year 1999 16 2000 16 2001 16 2002 16 2003 16 2004 16 2005 16 2006 16 2007 16 2008 16 2009 16 2010 16 2011 16 2012 16 2013 16 2014 16 2015 16 2016 16 2017 16 2018 16 2019 16 2020 16 dtype: int64
# CHART: 1.1.
# Creating visual chart showing number of murders per each province and year (1999-2020).
# The general idea is to show overall trend over time and identify differences between locations.
# Adjustment of the chart details include: creating legend, descriptions, margins, overall look and feel.
plt.style.use('fivethirtyeight')
provinces = np.unique(data['administrative_unit'])
colors = [plt.cm.viridis(i/float(len(provinces)-1)) for i in range(len(provinces))]
plt.figure(figsize = (22, 14))
for i, administrative_unit in enumerate(provinces):
plt.scatter('year', 'crime_registered',
data = data.loc[data.administrative_unit == administrative_unit, :],
s = 60,
color = colors[i],
label = str(administrative_unit))
for j, administrative_unit in enumerate(provinces):
plt.plot('year', 'crime_registered',
data = data.loc[data.administrative_unit == administrative_unit, :],
linewidth = 1,
color = colors[j],
label = "_nolegend_")
plt.title('Murders by province (1999-2020)',
fontsize = 25,
fontweight = 'bold',
pad = 40)
plt.xlabel('Year', fontsize = 19,
labelpad = 40,
fontweight = 'bold')
plt.ylabel('Murders',
fontsize = 19,
labelpad = 40,
fontweight = 'bold')
plt.legend(fontsize = 20)
plt.xticks(list(data.year.unique()),
fontsize = 19)
plt.yticks(fontsize = 19)
plt.tight_layout()
plt.margins(0.1)
plt.show()
# Summarizing data regarding the number of murders broken down by provinces.
# Creating separate dataframe in order ot use it in further analyses.
summary_data = data.groupby('administrative_unit', as_index = False)\
.aggregate({'crime_registered': 'sum', 'crime_detected': 'sum'})
summary_data
| administrative_unit | crime_registered | crime_detected | |
|---|---|---|---|
| 0 | Greater Poland | 960 | 884 |
| 1 | Kuyavian-Pomeranian | 645 | 618 |
| 2 | Lesser Poland | 996 | 917 |
| 3 | Lodzkie | 1498 | 1428 |
| 4 | Lower Silesian | 1725 | 1641 |
| 5 | Lublin | 995 | 938 |
| 6 | Lubusz | 655 | 615 |
| 7 | Masovia | 2728 | 2323 |
| 8 | Opole | 319 | 304 |
| 9 | Podlaskie | 548 | 519 |
| 10 | Pomerania | 1009 | 933 |
| 11 | Silesian | 2198 | 2107 |
| 12 | Subcarpathian | 465 | 431 |
| 13 | Warmian-Masurian | 748 | 711 |
| 14 | West Pomerania | 1095 | 1023 |
| 15 | Świętokrzyskie | 491 | 453 |
# CHART 1.2.
# Generating visual representation of the summary of the murders within 1999-2020, per province.
plt.style.use('fivethirtyeight')
color = plt.cm.winter(np.linspace(0, 1, 10))
data.groupby('administrative_unit')\
.crime_registered.sum()\
.plot.bar(color = color,
figsize = (20, 11))
plt.title('Sum of murders per province (1999-2020)',
fontsize = 24,
fontweight = 'bold',
pad = 40)
plt.xlabel('Province',
fontsize = 18,
labelpad = 30,
fontweight = 'bold')
plt.ylabel('Sum of murders',
fontsize = 18,
labelpad = 30,
fontweight = 'bold')
plt.xticks(rotation=90, fontsize=16)
plt.yticks(fontsize=16)
plt.tight_layout()
plt.margins(0.1)
plt.show()
# Calcultaing the crime share rate during 1999-2020 per each province.
# Including these values in the sumamry_data as a seperate column / variable (share).
summary_Poland = sum(data['crime_registered'])
share = (summary_data['crime_registered'] / summary_Poland)*100
share.round()
0 6.0 1 4.0 2 6.0 3 9.0 4 10.0 5 6.0 6 4.0 7 16.0 8 2.0 9 3.0 10 6.0 11 13.0 12 3.0 13 4.0 14 6.0 15 3.0 Name: crime_registered, dtype: float64
summary_data['share'] = share.round()
summary_data
| administrative_unit | crime_registered | crime_detected | share | |
|---|---|---|---|---|
| 0 | Greater Poland | 960 | 884 | 6.0 |
| 1 | Kuyavian-Pomeranian | 645 | 618 | 4.0 |
| 2 | Lesser Poland | 996 | 917 | 6.0 |
| 3 | Lodzkie | 1498 | 1428 | 9.0 |
| 4 | Lower Silesian | 1725 | 1641 | 10.0 |
| 5 | Lublin | 995 | 938 | 6.0 |
| 6 | Lubusz | 655 | 615 | 4.0 |
| 7 | Masovia | 2728 | 2323 | 16.0 |
| 8 | Opole | 319 | 304 | 2.0 |
| 9 | Podlaskie | 548 | 519 | 3.0 |
| 10 | Pomerania | 1009 | 933 | 6.0 |
| 11 | Silesian | 2198 | 2107 | 13.0 |
| 12 | Subcarpathian | 465 | 431 | 3.0 |
| 13 | Warmian-Masurian | 748 | 711 | 4.0 |
| 14 | West Pomerania | 1095 | 1023 | 6.0 |
| 15 | Świętokrzyskie | 491 | 453 | 3.0 |
# CHART 2.1.
# Creatign visual representation as a pie chart showing share of each of the provinces in the overall number of murders.
# Thre of the provinces are distinguished: Masovia, Silesian, Lower Silesian.
plt.figure(figsize = (17,13))
labels = summary_data['administrative_unit']
sizes = summary_data['share']
explode = (0.1, 0, 0, 0, 0.1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0.1, 0)
color = plt.cm.winter(np.linspace(0, 2, 10))
plt.pie(sizes,
labels = labels,
explode = explode,
colors = color,
autopct = '%0.1f%%',
textprops = dict(fontsize = 14),
wedgeprops = {'linewidth': 2.0,
'edgecolor' : 'white'})
plt.title('Share rate of the provinces in murders overall (1999-2020)',
fontsize = 22,
fontweight = 'bold',
pad = 10)
plt.axis('equal')
plt.tight_layout()
plt.margins(0.1)
plt.show()
# Checking results of the highest and the lowest records each year.
# In order to observe trends per locations, choosing sample of 5 rows per category.
# Generating matrix view showing number of murders in each province and each year.
records = data.loc[data.groupby('year')['crime_registered'].nlargest(1).index.get_level_values(1)]
minimum = data.loc[data.groupby('year')['crime_registered'].nsmallest(1).index.get_level_values(1)]
records.head(), minimum.head()
( administrative_unit year crime_registered crime_detected \
153 Masovia 1999 172 124
152 Masovia 2000 204 166
151 Masovia 2001 233 172
150 Masovia 2002 182 147
149 Masovia 2003 193 152
detection_rate
153 72.0
152 81.0
151 74.0
150 81.0
149 79.0 ,
administrative_unit year crime_registered crime_detected \
175 Opole 1999 23 21
174 Opole 2000 24 23
173 Opole 2001 21 20
172 Opole 2002 19 19
171 Opole 2003 19 18
detection_rate
175 91.0
174 96.0
173 95.0
172 100.0
171 95.0 )
# MATRIX 3.1.
# Marking with 'gold' the greatest number of murders within each year. Masovia leads in the ranking.
matrix_max = data.groupby(['administrative_unit', 'year'])['crime_registered'].sum().unstack()
matrix_max.style.highlight_max(color = 'gold', axis = 0)
| year | 1999 | 2000 | 2001 | 2002 | 2003 | 2004 | 2005 | 2006 | 2007 | 2008 | 2009 | 2010 | 2011 | 2012 | 2013 | 2014 | 2015 | 2016 | 2017 | 2018 | 2019 | 2020 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| administrative_unit | ||||||||||||||||||||||
| Greater Poland | 44 | 53 | 59 | 76 | 81 | 60 | 49 | 42 | 42 | 39 | 47 | 36 | 42 | 40 | 31 | 33 | 23 | 21 | 29 | 36 | 44 | 33 |
| Kuyavian-Pomeranian | 50 | 47 | 48 | 30 | 31 | 25 | 38 | 34 | 34 | 27 | 32 | 29 | 33 | 12 | 22 | 20 | 19 | 23 | 27 | 22 | 15 | 27 |
| Lesser Poland | 45 | 59 | 83 | 71 | 46 | 64 | 44 | 46 | 47 | 49 | 53 | 42 | 40 | 49 | 38 | 42 | 40 | 17 | 23 | 29 | 33 | 36 |
| Lodzkie | 64 | 108 | 119 | 123 | 98 | 98 | 88 | 81 | 87 | 75 | 56 | 64 | 58 | 43 | 51 | 50 | 42 | 28 | 36 | 44 | 41 | 44 |
| Lower Silesian | 101 | 134 | 125 | 119 | 106 | 107 | 70 | 84 | 94 | 88 | 83 | 63 | 61 | 62 | 70 | 41 | 51 | 53 | 52 | 59 | 33 | 69 |
| Lublin | 66 | 75 | 55 | 62 | 47 | 52 | 50 | 47 | 73 | 53 | 62 | 41 | 48 | 29 | 46 | 28 | 18 | 35 | 27 | 34 | 24 | 23 |
| Lubusz | 44 | 62 | 52 | 46 | 34 | 33 | 27 | 24 | 29 | 22 | 27 | 27 | 29 | 20 | 17 | 28 | 29 | 28 | 13 | 25 | 13 | 26 |
| Masovia | 172 | 204 | 233 | 182 | 193 | 147 | 154 | 126 | 155 | 123 | 119 | 106 | 100 | 90 | 70 | 74 | 63 | 68 | 92 | 80 | 70 | 107 |
| Opole | 23 | 24 | 21 | 19 | 19 | 13 | 13 | 16 | 15 | 14 | 14 | 17 | 10 | 17 | 10 | 9 | 15 | 8 | 9 | 12 | 13 | 8 |
| Podlaskie | 37 | 49 | 45 | 62 | 32 | 36 | 26 | 34 | 26 | 18 | 14 | 16 | 14 | 16 | 24 | 14 | 17 | 11 | 15 | 14 | 13 | 15 |
| Pomerania | 70 | 83 | 64 | 52 | 67 | 73 | 44 | 42 | 38 | 46 | 33 | 43 | 40 | 27 | 33 | 38 | 29 | 21 | 46 | 27 | 41 | 52 |
| Silesian | 145 | 157 | 188 | 131 | 116 | 138 | 107 | 119 | 103 | 89 | 99 | 90 | 88 | 78 | 69 | 67 | 77 | 69 | 51 | 63 | 71 | 83 |
| Subcarpathian | 25 | 37 | 38 | 26 | 27 | 19 | 21 | 26 | 12 | 26 | 26 | 14 | 14 | 16 | 28 | 25 | 6 | 15 | 15 | 16 | 22 | 11 |
| Warmian-Masurian | 53 | 52 | 60 | 73 | 53 | 37 | 41 | 44 | 27 | 29 | 40 | 26 | 21 | 26 | 18 | 18 | 19 | 22 | 25 | 17 | 23 | 24 |
| West Pomerania | 77 | 80 | 97 | 78 | 62 | 56 | 42 | 31 | 49 | 39 | 41 | 46 | 37 | 45 | 34 | 35 | 37 | 27 | 38 | 27 | 52 | 65 |
| Świętokrzyskie | 32 | 46 | 38 | 38 | 27 | 24 | 24 | 20 | 19 | 23 | 19 | 21 | 30 | 14 | 13 | 5 | 11 | 10 | 16 | 26 | 16 | 19 |
# MATRIX 3.2.
# Marking with 'lime' the lowest number of murders within each year. Opole detected as the one which stands out.
matrix_min = data.groupby(['administrative_unit', 'year'])['crime_registered'].sum().unstack()
matrix_min.style.highlight_min(color = 'lime', axis = 0)
| year | 1999 | 2000 | 2001 | 2002 | 2003 | 2004 | 2005 | 2006 | 2007 | 2008 | 2009 | 2010 | 2011 | 2012 | 2013 | 2014 | 2015 | 2016 | 2017 | 2018 | 2019 | 2020 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| administrative_unit | ||||||||||||||||||||||
| Greater Poland | 44 | 53 | 59 | 76 | 81 | 60 | 49 | 42 | 42 | 39 | 47 | 36 | 42 | 40 | 31 | 33 | 23 | 21 | 29 | 36 | 44 | 33 |
| Kuyavian-Pomeranian | 50 | 47 | 48 | 30 | 31 | 25 | 38 | 34 | 34 | 27 | 32 | 29 | 33 | 12 | 22 | 20 | 19 | 23 | 27 | 22 | 15 | 27 |
| Lesser Poland | 45 | 59 | 83 | 71 | 46 | 64 | 44 | 46 | 47 | 49 | 53 | 42 | 40 | 49 | 38 | 42 | 40 | 17 | 23 | 29 | 33 | 36 |
| Lodzkie | 64 | 108 | 119 | 123 | 98 | 98 | 88 | 81 | 87 | 75 | 56 | 64 | 58 | 43 | 51 | 50 | 42 | 28 | 36 | 44 | 41 | 44 |
| Lower Silesian | 101 | 134 | 125 | 119 | 106 | 107 | 70 | 84 | 94 | 88 | 83 | 63 | 61 | 62 | 70 | 41 | 51 | 53 | 52 | 59 | 33 | 69 |
| Lublin | 66 | 75 | 55 | 62 | 47 | 52 | 50 | 47 | 73 | 53 | 62 | 41 | 48 | 29 | 46 | 28 | 18 | 35 | 27 | 34 | 24 | 23 |
| Lubusz | 44 | 62 | 52 | 46 | 34 | 33 | 27 | 24 | 29 | 22 | 27 | 27 | 29 | 20 | 17 | 28 | 29 | 28 | 13 | 25 | 13 | 26 |
| Masovia | 172 | 204 | 233 | 182 | 193 | 147 | 154 | 126 | 155 | 123 | 119 | 106 | 100 | 90 | 70 | 74 | 63 | 68 | 92 | 80 | 70 | 107 |
| Opole | 23 | 24 | 21 | 19 | 19 | 13 | 13 | 16 | 15 | 14 | 14 | 17 | 10 | 17 | 10 | 9 | 15 | 8 | 9 | 12 | 13 | 8 |
| Podlaskie | 37 | 49 | 45 | 62 | 32 | 36 | 26 | 34 | 26 | 18 | 14 | 16 | 14 | 16 | 24 | 14 | 17 | 11 | 15 | 14 | 13 | 15 |
| Pomerania | 70 | 83 | 64 | 52 | 67 | 73 | 44 | 42 | 38 | 46 | 33 | 43 | 40 | 27 | 33 | 38 | 29 | 21 | 46 | 27 | 41 | 52 |
| Silesian | 145 | 157 | 188 | 131 | 116 | 138 | 107 | 119 | 103 | 89 | 99 | 90 | 88 | 78 | 69 | 67 | 77 | 69 | 51 | 63 | 71 | 83 |
| Subcarpathian | 25 | 37 | 38 | 26 | 27 | 19 | 21 | 26 | 12 | 26 | 26 | 14 | 14 | 16 | 28 | 25 | 6 | 15 | 15 | 16 | 22 | 11 |
| Warmian-Masurian | 53 | 52 | 60 | 73 | 53 | 37 | 41 | 44 | 27 | 29 | 40 | 26 | 21 | 26 | 18 | 18 | 19 | 22 | 25 | 17 | 23 | 24 |
| West Pomerania | 77 | 80 | 97 | 78 | 62 | 56 | 42 | 31 | 49 | 39 | 41 | 46 | 37 | 45 | 34 | 35 | 37 | 27 | 38 | 27 | 52 | 65 |
| Świętokrzyskie | 32 | 46 | 38 | 38 | 27 | 24 | 24 | 20 | 19 | 23 | 19 | 21 | 30 | 14 | 13 | 5 | 11 | 10 | 16 | 26 | 16 | 19 |
---- Back to top ----
# Calculating maximum and minimum values per province to see the most remarkable differences or potential anomalies.
# Merging the ruslts into one dataframe.
min_province = data.groupby('administrative_unit').crime_registered.min()
record_province = data.groupby('administrative_unit').crime_registered.max()
values_max_min = pd.merge(record_province, min_province, on = 'administrative_unit')
values_max_min.columns = ['maximum', 'minimum']
values_max_min
| maximum | minimum | |
|---|---|---|
| administrative_unit | ||
| Greater Poland | 81 | 21 |
| Kuyavian-Pomeranian | 50 | 12 |
| Lesser Poland | 83 | 17 |
| Lodzkie | 123 | 28 |
| Lower Silesian | 134 | 33 |
| Lublin | 75 | 18 |
| Lubusz | 62 | 13 |
| Masovia | 233 | 63 |
| Opole | 24 | 8 |
| Podlaskie | 62 | 11 |
| Pomerania | 83 | 21 |
| Silesian | 188 | 51 |
| Subcarpathian | 38 | 6 |
| Warmian-Masurian | 73 | 17 |
| West Pomerania | 97 | 27 |
| Świętokrzyskie | 46 | 5 |
# CHART 4.1.
# Generating visual representation showing minimum and maximum number of murders per year, within each province.
# Creating two layers, by combining line graph with scatter graph in order to see the overall trend and data points.
plt.style.use('fivethirtyeight')
plt.figure(figsize = (20, 12))
provinces = np.unique(data['administrative_unit'])
plt.scatter(provinces,
record_province,
s = 150,
color = 'navy')
plt.scatter(provinces,
min_province,
s = 150,
color = 'lime')
plt.title('Maximum and minimum number of murders in provinces',
fontsize = 24,
fontweight = 'bold',
pad = 40)
plt.xlabel('Province',
fontsize = 19,
labelpad = 20,
fontweight = 'bold')
plt.ylabel('Number of murders',
fontsize = 19,
labelpad = 40,
fontweight = 'bold')
plt.legend(['Maximum','Minimum'],
bbox_to_anchor = (0.95, 1.03),
fontsize = 19,
frameon = True,
facecolor = 'white',
edgecolor = 'black')
plt.plot(provinces,
record_province,
color = 'navy',
linewidth = 1)
plt.plot(provinces,
min_province,
color = 'lime',
linewidth = 1)
plt.xticks(rotation = 90,
fontsize = 18)
plt.yticks(fontsize = 18)
plt.tight_layout()
plt.margins(0.1)
plt.show()
# TABLE 5.1.
# Veryfying the years with the highest records of murder crimes; marking out with 'gold'.
# Using 5 or more times in the ranking of the highest number per year to see the most intensive periods.
max_year =\
(data.loc[data.groupby('administrative_unit')['crime_registered']
.nlargest(1)
.index.get_level_values(1)])\
.groupby('year')\
.count()
def custom_style(row):
color = 'white'
if row.values[-1] >= 5:
color = 'gold'
return ['background-color: %s' % color]*len(row.values)
max_year.style.apply(custom_style, axis=1)
| administrative_unit | crime_registered | crime_detected | detection_rate | |
|---|---|---|---|---|
| year | ||||
| 1999 | 1 | 1 | 1 | 1 |
| 2000 | 6 | 6 | 6 | 6 |
| 2001 | 5 | 5 | 5 | 5 |
| 2002 | 3 | 3 | 3 | 3 |
| 2003 | 1 | 1 | 1 | 1 |
# TABLE 5.2.
# Veryfying the years with the lowest records of murder crimes; marking out with 'lime'.
# Using 5 or more times in the ranking of the lowest number per year to see the most peaceful periods.
min_year =\
(data.loc[data.groupby('administrative_unit')['crime_registered']
.nsmallest(1)
.index.get_level_values(1)])\
.groupby('year')\
.count()
def custom_style(row):
color = 'white'
if row.values[-1] >= 5:
color = 'chartreuse'
return ['background-color: %s' % color]*len(row.values)
min_year.style.apply(custom_style, axis=1)
| administrative_unit | crime_registered | crime_detected | detection_rate | |
|---|---|---|---|---|
| year | ||||
| 2012 | 1 | 1 | 1 | 1 |
| 2014 | 1 | 1 | 1 | 1 |
| 2015 | 3 | 3 | 3 | 3 |
| 2016 | 5 | 5 | 5 | 5 |
| 2017 | 1 | 1 | 1 | 1 |
| 2018 | 2 | 2 | 2 | 2 |
| 2019 | 2 | 2 | 2 | 2 |
| 2020 | 1 | 1 | 1 | 1 |
# CHART 6.1.
# Generating visual representation of showing minimum and maximum number of murders within each year.
# Creating two layers, by combining line graph with scatter graph in order to see the overall trend and data points.
plt.style.use('fivethirtyeight')
plt.figure(figsize = (20, 12))
year = records['year']
max_reg = records['crime_registered']
min_reg = minimum['crime_registered']
plt.scatter(year,
max_reg,
s = 150,
color = 'navy')
plt.scatter(year,
min_reg,
s= 150,
color = 'lime')
plt.title("aximum and minimum number of murders each year",
fontsize = 26,
fontweight = 'bold',
pad = 40)
plt.xlabel('Year',
fontsize = 19,
labelpad = 40,
fontweight = 'bold')
plt.ylabel('Number of murders',
fontsize = 20,
labelpad = 40,
fontweight = 'bold')
plt.legend(['Maximum','Minimum'],
bbox_to_anchor = (0.93, 1.02),
fontsize = 20,
frameon = True,
facecolor = 'white',
edgecolor = 'black')
plt.plot(year,
max_reg,
color = 'navy',
linewidth = 1)
plt.plot(year,
min_reg,
color = 'lime',
linewidth = 1)
plt.xticks(list(records.year),
fontsize = 18)
plt.yticks(fontsize = 18)
plt.tight_layout()
plt.margins(0.1)
plt.show()
# Summing data about crimes (murders) broken by year.
# Recalculating detection rate per each year and also the percentage of undetected crimes (100 minus detected rate).
# Creating separate dataframe in order to use it further in the analysis.
sum_years = data.groupby('year').aggregate({'crime_registered': 'sum', 'crime_detected': 'sum'})
sum_years['detected'] = ((sum_years['crime_detected'] / sum_years['crime_registered'])*100).round()
sum_years['undetected'] = 100.00 - (sum_years['detected'])
sum_years
| crime_registered | crime_detected | detected | undetected | |
|---|---|---|---|---|
| year | ||||
| 1999 | 1048 | 903 | 86.0 | 14.0 |
| 2000 | 1270 | 1122 | 88.0 | 12.0 |
| 2001 | 1325 | 1171 | 88.0 | 12.0 |
| 2002 | 1188 | 1081 | 91.0 | 9.0 |
| 2003 | 1039 | 934 | 90.0 | 10.0 |
| 2004 | 982 | 911 | 93.0 | 7.0 |
| 2005 | 838 | 770 | 92.0 | 8.0 |
| 2006 | 816 | 762 | 93.0 | 7.0 |
| 2007 | 850 | 813 | 96.0 | 4.0 |
| 2008 | 760 | 719 | 95.0 | 5.0 |
| 2009 | 765 | 719 | 94.0 | 6.0 |
| 2010 | 681 | 633 | 93.0 | 7.0 |
| 2011 | 665 | 634 | 95.0 | 5.0 |
| 2012 | 584 | 555 | 95.0 | 5.0 |
| 2013 | 574 | 549 | 96.0 | 4.0 |
| 2014 | 527 | 506 | 96.0 | 4.0 |
| 2015 | 496 | 470 | 95.0 | 5.0 |
| 2016 | 456 | 440 | 96.0 | 4.0 |
| 2017 | 514 | 500 | 97.0 | 3.0 |
| 2018 | 531 | 511 | 96.0 | 4.0 |
| 2019 | 524 | 512 | 98.0 | 2.0 |
| 2020 | 642 | 630 | 98.0 | 2.0 |
# CHART 7.1.
# Generating visual representation showing the level of crime/murder detection per each year.
# The chart gives clear view of detection rate improving over time, with decrease in the number of murders at the same time.
plt.style.use('fivethirtyeight')
plt.figure(figsize = (22, 12))
sum_reg = sum_years['crime_registered']
sum_det = sum_years['crime_detected']
sum_year = sorted(data.year.unique())
indices = np.arange(len(sum_reg))
plt.bar(indices,
sum_reg,
width = 0.7,
color = 'navy',
label = 'Murders')
plt.bar(indices,
sum_det,
width = 0.1,
color = 'lime',
label = 'Detections')
plt.xticks(indices,
["{}".format(i) for i in (sum_year)],
fontsize = 19)
plt.yticks(fontsize = 19)
plt.title('Suma of detected and undetected murders per year',
fontsize = 26,
fontweight = 'bold',
pad = 40)
plt.xlabel('Year',
fontsize = 22,
labelpad = 40,
fontweight = 'bold')
plt.ylabel('Number of murders',
fontsize = 22,
labelpad = 40,
fontweight = 'bold')
plt.legend(['Murders','Detections'],
bbox_to_anchor = (0.93, 1.02),
fontsize = 20,
frameon = True,
facecolor = 'white',
edgecolor = 'black')
plt.tight_layout()
plt.margins(0.1)
plt.show()
# CHART 7.2.
# Creating bar chart presenting proportion of the detected and undetected murders year after year.
# This confirms the trend of increasing detection rate.
plt.style.use('fivethirtyeight')
sum_years_detection = sum_years[['detected', 'undetected']].round()
ax = sum_years_detection.plot(kind = 'bar',
stacked = True,
width = 0.9,
figsize = (20, 12),
color = ['navy', 'magenta'])
for rect in ax.patches:
height = rect.get_height()
width = rect.get_width()
x = rect.get_x()
y = rect.get_y()
label_text = f'{height}'
label_x = x + width / 2
label_y = y + height / 2
if height > 0:
ax.text(label_x,
label_y,
label_text,
ha = 'center',
va = 'center',
fontsize = 14,
fontweight = 'bold',
color= 'white')
plt.title('Murders detection per year',
fontsize = 24,
fontweight = 'bold',
pad = 40)
plt.xlabel('Year',
fontsize = 20,
labelpad = 40,
fontweight = 'bold')
plt.ylabel('Detection rate %',
fontsize = 20,
labelpad = 40,
fontweight = 'bold')
plt.legend(['Detected','Undetected'],
bbox_to_anchor = (0.85, 0.94),
fontsize = 18,
frameon = True,
facecolor = 'white',
edgecolor = 'black')
plt.xticks(fontsize = 18)
plt.yticks(fontsize = 18)
plt.tight_layout()
plt.margins(0.1)
plt.show()
# CHART 8.1.
# Due to meaningful difference observed between 2019 and 2020, when the volume of crimes increases significantly,
# generating visual represantation showing sum of murders per each province and comparing data from the two periods.
# The sepearate data frame is being created in order to use this in further analysis.
plt.style.use('fivethirtyeight')
color = ('navy', 'lime')
year_2019_2020 = data[data['year'].isin([2019, 2020])]
sns.catplot(x = 'crime_registered',
y = 'administrative_unit',
data = year_2019_2020,
hue = 'year',
kind = 'bar',
height = 11,
aspect = 2,
palette = color,
legend = False)
plt.title('Sum of murders in each province in 2019 and 2020',
fontsize = 26,
fontweight = 'bold',
pad = 40)
plt.xlabel('Number of murders',
fontsize = 20,
labelpad = 30,
fontweight = 'bold')
plt.ylabel('Province',
fontsize = 20,
labelpad = 30,
fontweight = 'bold')
plt.legend(bbox_to_anchor = (0.96, 1.01),
fontsize = 18,
frameon = True,
facecolor = 'white',
edgecolor ='black')
plt.yticks(fontsize = 18)
plt.xticks(fontsize = 18)
plt.tight_layout()
plt.margins(0.1)
plt.show()
# Calculating percentage difference between 2019 and 2020 per each province.
# Pivoting the table and adding the newly calculated column as the new one in the dataframe.
year_2019_2020 = data[data['year'].isin([2019, 2020])]
diff_20_19 = year_2019_2020.pivot_table('crime_registered', ['administrative_unit'], 'year')
diff_20_19['percentage_diff'] = (((diff_20_19[2020] - diff_20_19[2019]) / diff_20_19[2019])*100).round()
diff_20_19
| year | 2019 | 2020 | percentage_diff |
|---|---|---|---|
| administrative_unit | |||
| Greater Poland | 44 | 33 | -25.0 |
| Kuyavian-Pomeranian | 15 | 27 | 80.0 |
| Lesser Poland | 33 | 36 | 9.0 |
| Lodzkie | 41 | 44 | 7.0 |
| Lower Silesian | 33 | 69 | 109.0 |
| Lublin | 24 | 23 | -4.0 |
| Lubusz | 13 | 26 | 100.0 |
| Masovia | 70 | 107 | 53.0 |
| Opole | 13 | 8 | -38.0 |
| Podlaskie | 13 | 15 | 15.0 |
| Pomerania | 41 | 52 | 27.0 |
| Silesian | 71 | 83 | 17.0 |
| Subcarpathian | 22 | 11 | -50.0 |
| Warmian-Masurian | 23 | 24 | 4.0 |
| West Pomerania | 52 | 65 | 25.0 |
| Świętokrzyskie | 16 | 19 | 19.0 |
# CHART 8.2.
# Generating tornado chart presenting difference in percentage in order to clearly pull apart the increases and decreases.
# Sorting the data by descending order to see the biggest incresease in first position.
plt.style.use('fivethirtyeight')
plt.figure(figsize = (20,15))
diff_20_19['colors'] = ['lime' if x < 0 else 'navy' for x in diff_20_19['percentage_diff']]
diff_20_19.sort_values('percentage_diff',
inplace = True)
plt.hlines(y = diff_20_19.index,
xmin = 0,
xmax = diff_20_19.percentage_diff,
color = diff_20_19.colors,
linewidth = 18)
for x, y, tex in zip(diff_20_19.percentage_diff,
diff_20_19.index,
diff_20_19.percentage_diff):
plt.text(x,
y,
round(tex, 2),
horizontalalignment = 'right' if x < 0 else 'left',
verticalalignment = 'center',
fontdict = {'color':'black' if x < 0 else 'navy', 'size':17})
plt.title('Percentage difference in the volume of murders in 2020 vs 2019',
fontsize = 24,
fontweight = 'bold',
pad = 40)
plt.xlabel('Percentage',
fontsize = 19,
labelpad = 30,
fontweight = 'bold')
plt.ylabel('Province',
fontsize = 19,
labelpad = 30,
fontweight = 'bold')
plt.xticks(fontsize = 16)
plt.yticks(fontsize = 16)
plt.tight_layout()
plt.margins(0.1)
plt.show()
# Opening second data file, including population and area of each province in 2020.
# Basing on this portion of data, additional variables are being defined in order to explore correlation with crimes.
data2 = pd.read_csv('C:/Users/dagny/Desktop/Data & Analytics/Python/provinces-population-area-Poland-2020.csv', sep=',')
# Veryfying data types as well as columns and rows structure.
data2.dtypes
province object population int64 area_km2 int64 dtype: object
data2.shape
(16, 3)
# Viewing the data in the datafarme.
data2
| province | population | area_km2 | |
|---|---|---|---|
| 0 | Lower Silesian | 2901225 | 19947 |
| 1 | Kuyavian-Pomeranian | 2077775 | 17972 |
| 2 | Lublin | 2117619 | 25122 |
| 3 | Lubusz | 1014548 | 13988 |
| 4 | Lodzkie | 2466322 | 18219 |
| 5 | Lesser Poland | 3400577 | 15183 |
| 6 | Masovia | 5403412 | 35558 |
| 7 | Opole | 986506 | 9412 |
| 8 | Subcarpathian | 2129015 | 17846 |
| 9 | Podlaskie | 1181533 | 20187 |
| 10 | Pomerania | 2333523 | 18321 |
| 11 | Silesian | 4533565 | 12333 |
| 12 | Świętokrzyskie | 1241546 | 11711 |
| 13 | Warmian-Masurian | 1428983 | 24173 |
| 14 | Greater Poland | 3493969 | 29826 |
| 15 | West Pomerania | 1701030 | 22897 |
# Displaying maximum, minimum and average values; verifying whether records count is the same per columns.
# Generated values are then used within grouping / segments definition.
data2.describe()
| population | area_km2 | |
|---|---|---|
| count | 1.600000e+01 | 16.000000 |
| mean | 2.400697e+06 | 19543.437500 |
| std | 1.277056e+06 | 6836.338883 |
| min | 9.865060e+05 | 9412.000000 |
| 25% | 1.382124e+06 | 14884.250000 |
| 50% | 2.123317e+06 | 18270.000000 |
| 75% | 3.026063e+06 | 23216.000000 |
| max | 5.403412e+06 | 35558.000000 |
# Creating segments according to four categories of provinces population.
# Data is used then to compare population segment with area segment to explore potential relationships.
def seg_population(population):
if population < 1382000:
return 'small'
elif population >= 1382000 and population < 2123000:
return 'medium'
elif population >= 2123000 and population < 3026000:
return 'big'
else:
return 'very big'
seg_population(30)
'small'
data2['population_segment'] = data2['population'].apply(seg_population)
data2.head()
| province | population | area_km2 | population_segment | |
|---|---|---|---|---|
| 0 | Lower Silesian | 2901225 | 19947 | big |
| 1 | Kuyavian-Pomeranian | 2077775 | 17972 | medium |
| 2 | Lublin | 2117619 | 25122 | medium |
| 3 | Lubusz | 1014548 | 13988 | small |
| 4 | Lodzkie | 2466322 | 18219 | big |
# Creating segments according to four categories of provinces area.
# Data is used then to compare population segment with area segment to explore potential relationships.
def seg_area(area):
if area < 14885:
return 'small'
elif area >= 14885 and area < 18271:
return 'medium'
elif area >= 18271 and area < 23217:
return 'big'
else:
return 'very big'
seg_area(30)
'small'
data2['area_segment'] = data2['area_km2'].apply(seg_area)
data2.head()
| province | population | area_km2 | population_segment | area_segment | |
|---|---|---|---|---|---|
| 0 | Lower Silesian | 2901225 | 19947 | big | big |
| 1 | Kuyavian-Pomeranian | 2077775 | 17972 | medium | medium |
| 2 | Lublin | 2117619 | 25122 | medium | very big |
| 3 | Lubusz | 1014548 | 13988 | small | small |
| 4 | Lodzkie | 2466322 | 18219 | big | medium |
# Previewing segment with provinces with the biggest population, including descending sorting order.
data2.loc[data2['population_segment'] == 'very big']\
.reset_index()\
.sort_values(['population'],
ascending = False)
| index | province | population | area_km2 | population_segment | area_segment | |
|---|---|---|---|---|---|---|
| 1 | 6 | Masovia | 5403412 | 35558 | very big | very big |
| 2 | 11 | Silesian | 4533565 | 12333 | very big | small |
| 3 | 14 | Greater Poland | 3493969 | 29826 | very big | very big |
| 0 | 5 | Lesser Poland | 3400577 | 15183 | very big | medium |
# Grouping data related to registered and detected crimes in 2020: registered, detected, detections.
# Adding it to the dataset with population and area, broken by provinces, as seperate columns.
year_2020 = data_groupby_year.get_group(2020)
data2[['crime_registered', 'crime_detected', 'detection_rate']] =\
year_2020[['crime_registered', 'crime_detected', 'detection_rate']].values
data2
| province | population | area_km2 | population_segment | area_segment | crime_registered | crime_detected | detection_rate | |
|---|---|---|---|---|---|---|---|---|
| 0 | Lower Silesian | 2901225 | 19947 | big | big | 69.0 | 68.0 | 99.0 |
| 1 | Kuyavian-Pomeranian | 2077775 | 17972 | medium | medium | 27.0 | 25.0 | 93.0 |
| 2 | Lublin | 2117619 | 25122 | medium | very big | 23.0 | 23.0 | 100.0 |
| 3 | Lubusz | 1014548 | 13988 | small | small | 26.0 | 26.0 | 100.0 |
| 4 | Lodzkie | 2466322 | 18219 | big | medium | 44.0 | 43.0 | 98.0 |
| 5 | Lesser Poland | 3400577 | 15183 | very big | medium | 36.0 | 35.0 | 97.0 |
| 6 | Masovia | 5403412 | 35558 | very big | very big | 107.0 | 105.0 | 98.0 |
| 7 | Opole | 986506 | 9412 | small | small | 8.0 | 8.0 | 100.0 |
| 8 | Subcarpathian | 2129015 | 17846 | big | medium | 11.0 | 11.0 | 100.0 |
| 9 | Podlaskie | 1181533 | 20187 | small | big | 15.0 | 15.0 | 100.0 |
| 10 | Pomerania | 2333523 | 18321 | big | big | 52.0 | 51.0 | 98.0 |
| 11 | Silesian | 4533565 | 12333 | very big | small | 83.0 | 82.0 | 99.0 |
| 12 | Świętokrzyskie | 1241546 | 11711 | small | small | 19.0 | 19.0 | 100.0 |
| 13 | Warmian-Masurian | 1428983 | 24173 | medium | very big | 24.0 | 24.0 | 100.0 |
| 14 | Greater Poland | 3493969 | 29826 | very big | very big | 33.0 | 30.0 | 91.0 |
| 15 | West Pomerania | 1701030 | 22897 | medium | big | 65.0 | 65.0 | 100.0 |
# CHART 9.1.
# Generating interactive bubble chart visualization, presenting number of murders in each province in 2020.
# The chart is aimed at showing relation between population and area of the particular province.
pio.renderers.default = 'notebook'
plt.figure(figsize=(22, 18))
fig = px.scatter(data2,
x='population',
y='area_km2',
color='province',
size='crime_registered',
hover_data=['province'],
labels={
'population': 'Population in millions',
'area_km2': 'Area in km2',
'province': 'Province'},
title = '<b>Population and area of the provinces in 2020<b>',
width=1000,
height=600)
fig.update_layout(
title={
'y':0.9,
'x':0.5,
'xanchor': 'center',
'yanchor': 'top'},
font=dict(
size = 13),
hoverlabel=dict(
bgcolor='white',
font_size=12))
fig.show()
<Figure size 1584x1296 with 0 Axes>
# Calculationg crime rate using population variable. Multyplying the division result by 100 000.
# This provides information about number of crimes per 100.000 residents of the province.
crime_population_rate = (data2['crime_registered'] / data2['population']*100000).round(2)
crime_population_rate
0 2.38 1 1.30 2 1.09 3 2.56 4 1.78 5 1.06 6 1.98 7 0.81 8 0.52 9 1.27 10 2.23 11 1.83 12 1.53 13 1.68 14 0.94 15 3.82 dtype: float64
# Calculationg crime rate using area variable. Multyplying the division result by 1000.
# This provides information about number of crimes per 1000 km2 of the province.
crime_area_rate = (data2['crime_registered'] / data2['area_km2']*1000).round(2)
crime_area_rate
0 3.46 1 1.50 2 0.92 3 1.86 4 2.42 5 2.37 6 3.01 7 0.85 8 0.62 9 0.74 10 2.84 11 6.73 12 1.62 13 0.99 14 1.11 15 2.84 dtype: float64
# Additional calculation covers population density, with the usage of poulation and area variables.
# Density rate represents number of residents per 1 km2 of the province.
pop_density_rate = (data2['population'] / data2['area_km2']).round()
pop_density_rate
0 145.0 1 116.0 2 84.0 3 73.0 4 135.0 5 224.0 6 152.0 7 105.0 8 119.0 9 59.0 10 127.0 11 368.0 12 106.0 13 59.0 14 117.0 15 74.0 dtype: float64
# Adding all calculated data as seperate columns. This way the comparison is more intuitive and indicate some relations.
# The numbers reveal that the volume of murders comitted itself is not enough to judge about the danger level in province.
data2['crime_100-000'] = crime_population_rate
data2['crime_1000km2'] = crime_area_rate
data2['population_density'] = pop_density_rate
data2.reset_index().sort_values(['population'], ascending = False)
| index | province | population | area_km2 | population_segment | area_segment | crime_registered | crime_detected | detection_rate | crime_100-000 | crime_1000km2 | population_density | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 6 | 6 | Masovia | 5403412 | 35558 | very big | very big | 107.0 | 105.0 | 98.0 | 1.98 | 3.01 | 152.0 |
| 11 | 11 | Silesian | 4533565 | 12333 | very big | small | 83.0 | 82.0 | 99.0 | 1.83 | 6.73 | 368.0 |
| 14 | 14 | Greater Poland | 3493969 | 29826 | very big | very big | 33.0 | 30.0 | 91.0 | 0.94 | 1.11 | 117.0 |
| 5 | 5 | Lesser Poland | 3400577 | 15183 | very big | medium | 36.0 | 35.0 | 97.0 | 1.06 | 2.37 | 224.0 |
| 0 | 0 | Lower Silesian | 2901225 | 19947 | big | big | 69.0 | 68.0 | 99.0 | 2.38 | 3.46 | 145.0 |
| 4 | 4 | Lodzkie | 2466322 | 18219 | big | medium | 44.0 | 43.0 | 98.0 | 1.78 | 2.42 | 135.0 |
| 10 | 10 | Pomerania | 2333523 | 18321 | big | big | 52.0 | 51.0 | 98.0 | 2.23 | 2.84 | 127.0 |
| 8 | 8 | Subcarpathian | 2129015 | 17846 | big | medium | 11.0 | 11.0 | 100.0 | 0.52 | 0.62 | 119.0 |
| 2 | 2 | Lublin | 2117619 | 25122 | medium | very big | 23.0 | 23.0 | 100.0 | 1.09 | 0.92 | 84.0 |
| 1 | 1 | Kuyavian-Pomeranian | 2077775 | 17972 | medium | medium | 27.0 | 25.0 | 93.0 | 1.30 | 1.50 | 116.0 |
| 15 | 15 | West Pomerania | 1701030 | 22897 | medium | big | 65.0 | 65.0 | 100.0 | 3.82 | 2.84 | 74.0 |
| 13 | 13 | Warmian-Masurian | 1428983 | 24173 | medium | very big | 24.0 | 24.0 | 100.0 | 1.68 | 0.99 | 59.0 |
| 12 | 12 | Świętokrzyskie | 1241546 | 11711 | small | small | 19.0 | 19.0 | 100.0 | 1.53 | 1.62 | 106.0 |
| 9 | 9 | Podlaskie | 1181533 | 20187 | small | big | 15.0 | 15.0 | 100.0 | 1.27 | 0.74 | 59.0 |
| 3 | 3 | Lubusz | 1014548 | 13988 | small | small | 26.0 | 26.0 | 100.0 | 2.56 | 1.86 | 73.0 |
| 7 | 7 | Opole | 986506 | 9412 | small | small | 8.0 | 8.0 | 100.0 | 0.81 | 0.85 | 105.0 |
# Creating separate dataframe, with chosen variables to use within correlation exploration.
# The correlation analysis is focused in checking relationships between crime_registered and detection_rate (dependent)
# and set of newly defined variables (independent): population, area, population density.
data3 = data2[['province', 'population', 'area_km2','crime_registered', 'crime_detected', 'detection_rate', 'population_density']]
data3
| province | population | area_km2 | crime_registered | crime_detected | detection_rate | population_density | |
|---|---|---|---|---|---|---|---|
| 0 | Lower Silesian | 2901225 | 19947 | 69.0 | 68.0 | 99.0 | 145.0 |
| 1 | Kuyavian-Pomeranian | 2077775 | 17972 | 27.0 | 25.0 | 93.0 | 116.0 |
| 2 | Lublin | 2117619 | 25122 | 23.0 | 23.0 | 100.0 | 84.0 |
| 3 | Lubusz | 1014548 | 13988 | 26.0 | 26.0 | 100.0 | 73.0 |
| 4 | Lodzkie | 2466322 | 18219 | 44.0 | 43.0 | 98.0 | 135.0 |
| 5 | Lesser Poland | 3400577 | 15183 | 36.0 | 35.0 | 97.0 | 224.0 |
| 6 | Masovia | 5403412 | 35558 | 107.0 | 105.0 | 98.0 | 152.0 |
| 7 | Opole | 986506 | 9412 | 8.0 | 8.0 | 100.0 | 105.0 |
| 8 | Subcarpathian | 2129015 | 17846 | 11.0 | 11.0 | 100.0 | 119.0 |
| 9 | Podlaskie | 1181533 | 20187 | 15.0 | 15.0 | 100.0 | 59.0 |
| 10 | Pomerania | 2333523 | 18321 | 52.0 | 51.0 | 98.0 | 127.0 |
| 11 | Silesian | 4533565 | 12333 | 83.0 | 82.0 | 99.0 | 368.0 |
| 12 | Świętokrzyskie | 1241546 | 11711 | 19.0 | 19.0 | 100.0 | 106.0 |
| 13 | Warmian-Masurian | 1428983 | 24173 | 24.0 | 24.0 | 100.0 | 59.0 |
| 14 | Greater Poland | 3493969 | 29826 | 33.0 | 30.0 | 91.0 | 117.0 |
| 15 | West Pomerania | 1701030 | 22897 | 65.0 | 65.0 | 100.0 | 74.0 |
# CHART 9.2.
# Building correlation matrix in order to see the strongest relationship between the chosen variables.
plt.figure(figsize = (11,10), dpi = 90)
sns.heatmap(data3.corr(),
xticklabels = data3.corr().columns,
yticklabels = data3.corr().columns,
cmap = 'winter',
center = 0,
annot = True)
plt.title('Correlation map with chosen variables',
fontsize = 15,
fontweight = 'bold',
pad = 40)
plt.xticks(fontsize = 11,
fontweight = 'bold')
plt.yticks(fontsize = 11)
plt.tight_layout()
plt.show()
# METHOD 1: genarating p-value mapping, marking with '*'
# Each variable is geeting marked in order to display if the range of 0.01-0.1 statistical significance is met.
from scipy.stats import pearsonr
rho = data3.corr()
pval = data3.corr(method=lambda x, y: pearsonr(x, y)[1]) - np.eye(*rho.shape)
p = pval.applymap(lambda x: ''.join(['*' for t in [0.01,0.05,0.1] if x<=t]))
rho.round(2).astype(str) + p
| population | area_km2 | crime_registered | crime_detected | detection_rate | population_density | |
|---|---|---|---|---|---|---|
| population | 1.0*** | 0.48* | 0.8*** | 0.79*** | -0.37 | 0.7*** |
| area_km2 | 0.48* | 1.0*** | 0.43* | 0.42 | -0.32 | -0.25 |
| crime_registered | 0.8*** | 0.43* | 1.0*** | 1.0*** | -0.06 | 0.52** |
| crime_detected | 0.79*** | 0.42 | 1.0*** | 1.0*** | -0.03 | 0.52** |
| detection_rate | -0.37 | -0.32 | -0.06 | -0.03 | 1.0*** | -0.13 |
| population_density | 0.7*** | -0.25 | 0.52** | 0.52** | -0.13 | 1.0*** |
# METHOD 2: getting p-values per each pair of the variables.
def pearsonr_pval(x,y):
return pearsonr(x,y)[1]
corr = data3.corr(method=pearsonr_pval)
corr
| population | area_km2 | crime_registered | crime_detected | detection_rate | population_density | |
|---|---|---|---|---|---|---|
| population | 1.000000 | 0.060837 | 2.163474e-04 | 3.118559e-04 | 0.152622 | 0.002506 |
| area_km2 | 0.060837 | 1.000000 | 9.802036e-02 | 1.071582e-01 | 0.222672 | 0.356061 |
| crime_registered | 0.000216 | 0.098020 | 1.000000e+00 | 1.137996e-22 | 0.823096 | 0.037853 |
| crime_detected | 0.000312 | 0.107158 | 1.137996e-22 | 1.000000e+00 | 0.908343 | 0.039344 |
| detection_rate | 0.152622 | 0.222672 | 8.230961e-01 | 9.083428e-01 | 1.000000 | 0.630942 |
| population_density | 0.002506 | 0.356061 | 3.785298e-02 | 3.934443e-02 | 0.630942 | 1.000000 |
# METHOD 3: getting coefficient and p-values per chosen pairs of variables.
# Population
corr_pop = pd.DataFrame(columns=['r','p'])
for col in data3:
if pd.api.types.is_numeric_dtype(data3[col]) and col != 'population':
r, p = stats.pearsonr(data3.population, data3[col])
corr_pop.loc[col] = [round(r,3), round(p,3)]
# Density
corr_den = pd.DataFrame(columns=['r','p'])
for col in data3:
if pd.api.types.is_numeric_dtype(data3[col]) and col != 'population_density':
r, p = stats.pearsonr(data3.population_density, data3[col])
corr_den.loc[col] = [round(r,3), round(p,3)]
# Detection rate
corr_dtrate = pd.DataFrame(columns=['r','p'])
for col in data3:
if pd.api.types.is_numeric_dtype(data3[col]) and col != 'detection_rate':
r, p = stats.pearsonr(data3.detection_rate, data3[col])
corr_dtrate.loc[col] = [round(r,3), round(p,3)]
print('Population', '\r\n', corr_pop, '\r\n')
print('\r\n', 'Density', '\r\n', corr_den, '\r\n')
print('\r\n', 'Detection rate', '\r\n', corr_dtrate)
Population
r p
area_km2 0.478 0.061
crime_registered 0.797 0.000
crime_detected 0.785 0.000
detection_rate -0.375 0.153
population_density 0.701 0.003
Density
r p
population 0.701 0.003
area_km2 -0.247 0.356
crime_registered 0.523 0.038
crime_detected 0.519 0.039
detection_rate -0.130 0.631
Detection rate
r p
population -0.375 0.153
area_km2 -0.323 0.223
crime_registered -0.061 0.823
crime_detected -0.031 0.908
population_density -0.130 0.631
# CHART 9.3.
# Veryfying distribution of the values by applying linear regression plot with chosen variables.
# According to the results given above, exploring the strongest positive correaltions (coefficient values): population, density.
plt.style.use('seaborn')
sns.set(font_scale = 1.5)
sns.lmplot(x = 'population',
y = 'crime_registered',
data = data2,
line_kws = {'color': 'orchid'},
scatter_kws = {'color': 'navy'},
height = 7,
aspect = 2)
plt.title('Number of murders against population (2020)',
fontsize = 16,
fontweight = 'bold',
pad = 20)
plt.xlabel('Population of the provinces',
fontsize = 17,
labelpad = 20)
plt.ylabel('Number of murders',
fontsize = 17,
labelpad = 20)
plt.xticks(fontsize = 13)
plt.yticks(fontsize = 13)
plt.tight_layout()
plt.margins(0.1)
sns.lmplot(x = 'population_density',
y = 'crime_registered',
data = data2,
line_kws = {'color': 'lime'},
scatter_kws = {'color': 'black'},
height = 7,
aspect = 2)
plt.title('Number of murders against population_density (2020)',
fontsize = 16,
fontweight = 'bold',
pad = 20)
plt.xlabel('Population density: 1000 per km2',
fontsize = 17,
labelpad = 20)
plt.ylabel('Number of murders',
fontsize = 17,
labelpad = 20)
plt.xticks(fontsize = 13)
plt.yticks(fontsize = 13)
plt.tight_layout()
plt.margins(0.1)
plt.show
<function matplotlib.pyplot.show(close=None, block=None)>
# CHART 9.4.
# Veryfying distribution of the values by applying linear regression plot with chosen variables.
# According to the results given above, exploring the strongest negative correaltions (coefficient values): population, area.
plt.style.use('seaborn')
sns.set(font_scale = 1.5)
sns.lmplot(x = 'detection_rate',
y = 'population',
data = data2,
line_kws = {'color': 'coral'},
scatter_kws = {'color': 'darkgreen'},
height = 7,
aspect = 2)
plt.title('Detection rate against population (2020)',
fontsize = 16,
fontweight = 'bold',
pad = 20)
plt.xlabel('Detection rate of murders',
fontsize = 17,
labelpad = 20)
plt.ylabel('Population of the provinces',
fontsize = 17,
labelpad = 20)
plt.xticks(fontsize = 13)
plt.yticks(fontsize = 13)
plt.tight_layout()
plt.margins(0.1)
sns.lmplot(x = 'detection_rate',
y = 'area_km2',
data = data2,
line_kws={'color': 'yellow'},
scatter_kws = {'color': 'maroon'},
height = 7,
aspect = 2)
plt.title('Detection rate against the area in km2 (2020)',
fontsize = 16,
fontweight = 'bold',
pad = 20)
plt.xlabel('Detection rate of murders',
fontsize = 17,
labelpad = 20)
plt.ylabel('Area of the provinces in km2',
fontsize = 17,
labelpad = 20)
plt.xticks(fontsize = 13)
plt.yticks(fontsize = 13)
plt.tight_layout()
plt.margins(0.1)
plt.show
<function matplotlib.pyplot.show(close=None, block=None)>
# In order to verify the above results related to detection rate, in detail per province:
# 50 first records with the highest detection rate are genearted and summed up per province.
# These are the provinces with the smallest area or the ratio of population to area - equal to greater population density.
data.nlargest(50, 'detection_rate').groupby('administrative_unit').count()
| year | crime_registered | crime_detected | detection_rate | |
|---|---|---|---|---|
| administrative_unit | ||||
| Kuyavian-Pomeranian | 11 | 11 | 11 | 11 |
| Lesser Poland | 4 | 4 | 4 | 4 |
| Lodzkie | 5 | 5 | 5 | 5 |
| Lower Silesian | 4 | 4 | 4 | 4 |
| Lublin | 5 | 5 | 5 | 5 |
| Lubusz | 7 | 7 | 7 | 7 |
| Opole | 10 | 10 | 10 | 10 |
| Subcarpathian | 4 | 4 | 4 | 4 |
---- Back to top ----